Statistical Machine Translation of Subtitles: From OpenSubtitles to TED

نویسندگان

  • Mathias Müller
  • Martin Volk
چکیده

In this paper, we describe how the differences between subtitle corpora, OpenSubtitles and TED, influence machine translation quality. In particular, we investigate whether statistical machine translation systems built on their basis can be used interchangeably. Our results show that OpenSubtiles and TED contain very different kinds of subtitles that warrant a subclassification of the genre. In addition, we have taken a closer look at the translation of questions as a sentence type with special word order. Interestingly, we found the BLEU scores for questions to be higher than for random sentences. DOI: https://doi.org/10.1007/978-3-642-40722-2_14 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-82233 Accepted Version Originally published at: Müller, Mathias; Volk, Martin (2013). Statistical machine translation of subtitles: From OpenSubtitles to TED. In: Gurevych, Iryna; Biemann, Chris; Zesch, Torsten. Language Processing and Knowledge in the Web. Berlin Heidelberg: Springer, 132-138. DOI: https://doi.org/10.1007/978-3-642-40722-2_14 Statistical Machine Translation of Subtitles: From OpenSubtitles to TED Mathias Müller and Martin Volk Institute of Computational Linguistics, Zurich, Switzerland Abstract. In this paper, we describe how the differences between subtitle corpora, OpenSubtitles and TED, influence machine translation quality. In particular, we investigate whether statistical machine translation systems built on their basis can be used interchangeably. Our results show that OpenSubtiles and TED contain very different kinds of subtitles that warrant a subclassification of the genre. In addition, we have taken a closer look at the translation of questions as a sentence type with special word order. Interestingly, we found the BLEU scores for questions to be higher than for random sentences. In this paper, we describe how the differences between subtitle corpora, OpenSubtitles and TED, influence machine translation quality. In particular, we investigate whether statistical machine translation systems built on their basis can be used interchangeably. Our results show that OpenSubtiles and TED contain very different kinds of subtitles that warrant a subclassification of the genre. In addition, we have taken a closer look at the translation of questions as a sentence type with special word order. Interestingly, we found the BLEU scores for questions to be higher than for random sentences.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of translation model adaptation in statistical machine translation

Numerous empirical results have shown that combining data from multiple domains often improve statistical machine translation (SMT) performance. For example, if we desire to build SMT for the medical domain, it may be beneficial to augment the training data with bitext from another domain, such as parliamentary proceedings. Despite the positive results, it is not clear exactly how and where add...

متن کامل

Machine Translation of Film Subtitles from English to Spanish Combining a Statistical System with Rule - based Grammar

In this project we combined a statistical machine translation system for the translation of film subtitles from English to Spanish with rule-based grammar checking. At first we trained the best possible statistical machine translation system with the available training data. The largest part of the training corpus consists of freely available amateur subtitles. A smaller part are professionally...

متن کامل

Pre-reordering for Statistical Machine Translation of Non-fictional Subtitles

This paper describes the challenges of building a Statistical Machine Translation (SMT) system for non-fictional subtitles. Since our experiments focus on a “difficult“ translation direction (i.e. FrenchGerman), we investigate several methods to improve the translation performance. We also compare our in-house SMT systems (including domain adaptation and pre-reordering techniques) to other SMT ...

متن کامل

Cross-lingual Sentence Compression for Subtitles

We present an approach for translating subtitles where standard time and space constraints are modeled as part of the generation of translations in a phrase-based statistical machine translation system (PBSMT). We propose and experiment with two promising strategies for jointly translating and compressing subtitles from English into Portuguese. The quality of the automatic translations is measu...

متن کامل

Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles

Statistical Machine Translation (SMT) has been successfully employed to support translation of film subtitles. We explore the integration of Constraint Grammar corpus annotations into a Swedish–Danish subtitle SMT system in the framework of factored SMT. While the usefulness of the annotations is limited with large amounts of parallel data, we show that linguistic annotations can increase the g...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013